Image: 0.png top 5 scores for 0.png: [0.20210139453411102, 0.20152655243873596, 0.19944915175437927, 0.19897283613681793, 0.1947309821844101] # boxes before NMS (top_k_pre): 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: 1.png top 5 scores for 1.png: [0.20436616241931915, 0.19330492615699768, 0.19257889688014984, 0.1855223923921585, 0.18518216907978058] # boxes before NMS (top_k_pre): 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: 10.png top 5 scores for 10.png: [0.19172115623950958, 0.17924265563488007, 0.17075741291046143, 0.1706967055797577, 0.16946221888065338] # boxes before NMS (top_k_pre): 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: 11.png top 5 scores for 11.png: [0.1782817542552948, 0.17269743978977203, 0.1700783520936966, 0.1692238599061966, 0.16636724770069122] # boxes before NMS (top_k_pre): 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: 12.png top 5 scores for 12.png: [0.20829883217811584, 0.1904813051223755, 0.18734796345233917, 0.17319230735301971, 0.16764065623283386] # boxes before NMS (top_k_pre): 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: banana1.jpg top 5 scores for banana1.jpg: [0.20490090548992157, 0.19734756648540497, 0.19688843190670013, 0.19362686574459076, 0.19353023171424866] # boxes before NMS: 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: banana2.jpg top 5 scores for banana2.jpg: [0.18971116840839386, 0.1891666054725647, 0.18829527497291565, 0.18690736591815948, 0.18597716093063354] # boxes before NMS: 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: banana3.jpg top 5 scores for banana3.jpg: [0.24207830429077148, 0.22727727890014648, 0.21902689337730408, 0.21292327344417572, 0.2118563950061798] # boxes before NMS: 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: banana4.jpg top 5 scores for banana4.jpg: [0.216048464179039, 0.21234364807605743, 0.20538221299648285, 0.20354531705379486, 0.20189881324768066] # boxes before NMS: 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: banana5.jpg top 5 scores for banana5.jpg: [0.3227527439594269, 0.27928754687309265, 0.26482123136520386, 0.2580227255821228, 0.2434408962726593] # boxes before NMS: 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
Image: banana6.jpg top 5 scores for banana6.jpg: [0.2131563127040863, 0.20654091238975525, 0.20451699197292328, 0.18954293429851532, 0.18721458315849304] # boxes before NMS: 50 # boxes after custom NMS: 3 # boxes after tv NMS: 3
In summary, my SSD-lite banana detector successfully learns to localize a single “banana” class on the D2L dataset. The training loss curves show clear convergence, and on both validation images and my own banana photos, the model usually produces reasonable, although slightly loose, bounding boxes around the fruit. When the scene looks similar to the training data (a single, reasonably large banana on a clean background), the detector typically draws one box that covers most of the banana, which shows that the backbone, anchors, and loss design are sufficient for this simple setting.
However, I also observe a number of clear failure cases that highlight the limitations of my approach. When the banana is very small in the image or placed near the edges, the detector often either misses it entirely or fires on a nearby background region instead. In more cluttered scenes (e.g., banana on a messy desk or next to other yellow objects), the model sometimes places boxes on textures or colors that resemble the banana rather than on the banana itself, suggesting that the learned features are relatively shallow and sensitive to color rather than higher-level shape. On some of my custom images, where the banana is partially occluded or at an unusual orientation, the predicted box can drift and only cover part of the fruit, or become oversized and include a large chunk of background.
These behaviors are consistent with the design choices I made: a lightweight SSD-lite architecture with a tiny backbone, a single low-resolution feature map, coarse anchors, and training from scratch on a small, single-class dataset. A heavier detector such as Faster R-CNN with a pre-trained backbone and multi-scale features would generally handle small objects, clutter, and pose variation more robustly, but at a higher computational cost. In that sense, the inaccuracies I observe are not just random errors. They are directly tied to the architectural trade-offs I made to keep the model simple and efficient.
For this project I implemented a standard greedy NMS routine that sorts boxes by score, repeatedly keeps the highest-scoring box, and removes all remaining boxes whose IoU with it exceeds a chosen threshold. I then compared it directly with torchvision.ops.nms by running both on the same decoded boxes and scores from my SSD-lite detector and visualizing the results. In practice, the two methods produced almost identical outputs on all test and personal images: the same 1–3 boxes were kept after suppression, and any differences were minor (usually just which of two nearly overlapping, similar-score boxes survived). This confirms that my implementation matches the behavior of the PyTorch reference.
Conceptually, NMS’s purpose is to clean up the dense set of overlapping anchor predictions and turn them into a small set of final detections (ideally one box per object). It does not make the model more accurate by itself. It only removes redundant boxes. The main limitations I observed are:
(1) NMS blindly trusts the model’s scores, so if the detector assigns higher scores to a wrong region (e.g., my hand or the cabinet instead of the banana), NMS will keep that wrong box.
(2) It cannot fix localization errors. If all high-scoring boxes are slightly off, then final box will also be off.
(3) It depends on a hand-chosen IoU threshold, which can either suppress too many boxes or leave duplicates if set poorly.
For the HOI part, I used GPT as a vision–language model on three images: (1) a mounted police officer on a horse, (2) a motorcycle racer, and (3) a person reading a book. I prompted the model with: “List all human–object interactions in this image using the format